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ABSTRACT 

Motivation: Time-course gene expression data such as yeast 
cell cycle data may be periodically expressed. To cluster such 
data, currently used Fourier series approximations of periodic gene 
expressions have been found not to be sufficiently adequate to model 
the complexity of the time-course data, partly due to their ignoring 
the dependence between the expression measurements over time 
and the correlation among gene expression profiles. We further 
investigate the advantages and limitations of available models in 
the literature and propose a new mixture model with AR(1) random 
effects for the clustering of time-course gene-expression profiles. 
Some simulations and real examples are given to demonstrate the 
usefulness of the proposed models. 

Results: We illustrate the applicability of our new model using 
synthetic and real time-course datasets. We show that our model 
outperforms existing models to provide more reliable and robust 
clustering of time-course data. Our model provides superior results 
when genetic profiles are correlated. It also gives comparable results 
when the correlation between the gene profiles is weak. In the 
applications to real time-course data, relevant clusters of co-regulated 
genes are obtained, which are supported by gene-function annotation 
databases. 

Availability: An R-program is available on request from the 

corresponding author. 

Contact: g.mclachlan@uq.edu.au| 

Supplementary Information: http : / /www. maths . uq.edu . au/ 
~g jm/bioinf_2 011_supp . pdf . 



■ 1 INTRODUCTION 



DNA microarray analysis has emerged as a leading technology 
to enhance our understanding of gene regulation and function in 
cellular mechanism controls on a genomic scale. This technology 
has advanced to unravel the genetic machinery of biological 
rhythms by collecting massive gene-expression data in a time 
cours e. Time-course gene expression data such as yeast cell cycle 
data jWichert et ali i2004) appear to be periodically expressed. 
To associate the profile of gene expression with a physiological 
function of interest, it is crucial to cluster the types of gene 
expression on the basis of their periodic patterns. The identification 
of co-expressed genes also fac ilitates the prediction of response to 
treatment or toxic compounds jHafemeister et a/.ll201 ih . Statistical 
modelling and algorithms play a central role in cataloguing dynamic 
gene-expression profiles. 

*to wliom correspondence sliould be addressed 



Various computational models have been dev eloped for gene 
clustering based on cross-sectional microarray d ata jMcLachlan et al 
l2002l ; lRamoni et a/.L[20o3 ; lFan and Renll2006l) . Also, considerable 
attention has been paid to methodological derivations for detecting 
temporal patterns of gene expression in a time course based 
on functional principal component analysis or mixture model 
analysis |Qi n _^nd Self, 2006; Xu a/., 2002; Luan and Li, 2003 
Luanandll |2004|; Istorev ef fl/.L l2005i; iHo ng and Li. l200e 



Ma et alV |2006| ; iNg et ali |2006| ; \ Kim et al. [ Booa ; .Booth et al. 
20081) , including the applica tions t o identify diffe rentially expressed 



genes over time jPark et fl/l .'2003': Sun and Wei, 201 1"). 

Finite mixture models (McLachlan and Peel, 2000) have been 
widely used to model the distributions of a variety of random 
phenomena. Multivariate normality is generally assumed for 
multivariate data of a continuous nature. The multivariate normal 
mixture model is employed to detect different patterns in gene- 
expression profiles. However, when the two assumptions that are 
commonly adopted in practice, namely, 

(1) there are no replications on any particular entity specifically 
identified as such and 

(2) all the observations on the entities are independent of one 
another, 

are violated, multivariate normal mixture models may not be 
adequate. For example, condition (2) will not hold for the 
clustering of gene profiles, since not all the genes are independently 
distributed, and condition (1) will generally not hold either as the 
gene profiles may be measured over time or on technical replicates. 
While this correlated structure can be incorporated into the normal 
mixture model by appropriate specification of the component- 
covariance matrices, it is difficult to fit the model under such 
specifications. For example, the M-step may not exist in closed form 
jMcLachlan et fl/ll2004). 

Accordingly. iNg et al\ j2006l) have developed the procedure 
called EMMIX-WIRE (EM-based Mixture analysisWith Random 
Effects) to handle the clustering of correlated data that may be 
replicated. They adopted a mixture of linear mixed models to 
specify the correlation structure between the variables and to allow 
for correlations among the observations. It also enables covariate 
information to be incorporated into the clustering process iNg et all 
[200®. Proceeding con ditionally on th e tissue-specific random 
effects as foiTnulated in iNg et ali 1 I2OO6I) , the E- and M-steps can 
be implemented in closed form. In particular, an approximation to 
the E-step by carrying out time-consuming Monte Carlo methods is 
not required. A probabilistic or an outright clustering of the genes 
into g components can be obtained, based on the estimated posterior 
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probabilities of component membership given the profil e vectors 
and th e estimated tissue-specific random effects; see iNg et al\ 

Fourier series approximations have been used to model periodic 
gene expression, leading to the detection of periodic sign als in 
vario u s organisms including yeast and human cells ( Spellman et all 
1 19981 : IWichert et a/l |2004 iKim at all l2006h . If the genes studied 
are periodically regulated, their time-dependent expression can 
be accurately approxi mated by a Fourier series approximation 
jSpellman et a/1 [l998h . A general form of the fcth order Fourier 



series expansion is given as 



9k{t) 



■ ao 



+ '^^[ajCOs(2njt /oj) + bjsm{2njt/u 



(1) 



where ao is the average value of gk{t)- The other coefficients and 
hk are the amplitude coefficients that determine the times at which 
the gene achieves peak and trough expression levels, respectively, 
and cj is the period of the signal of gene expression. While the time- 
dependent expression value of a gene can be adequately modelled by 
a Fou rier series approximation of the first three orders dKim et cdX. 
l2008h . recent resuhs (Kim et al, 2008; Ng et al, 2006) demonstrate 
that the first-order Fourier series approximation is sufficient to 
provide good results in terms of clustering the time-course data 
into meaningful functional groups. Alternatively, the likelihood 
ratio test may be used to determine the order of the Fourier series 
approximation within the nested re gression models. 



The EMMIX-WIRE model of iNg et al\ ( l2006h is developed 
primarily for clustering gene s from gene ral microarray experimental 
designs. On the other hand, iKim et al\ |2008j) focus specifically on 
clustering periodic gene profiles and propose a special covariance 
structure to incorporate the correlation between observations at 
different time points. They als o review curren t methods and 
compare the i r meth od with that o f iNg et al. 1 1 I2OO6I) . More recently, 
IScharl et al\ ilOVj) use integrated autoregressive (AR) models to 
create cluster centers in their simulation study of mixtures of 
regression models for time-course ge ne expr e ssion d ata through the 
new v ersion of software FlexMix of iLeischI | |2004|) . Iwang and FanI 
propose mixtures of multivariate linear mixed models with 
autoregressive errors to analyse longitudinal data. In this paper, we 
propose a new EMMIX-WIRE normal mixture regression model 
with AR(1) random effects for the clustering of time-course data. 
In particular, the model accounts for the correlation among gene 
profiles and models the dependence between expressions over time 
via AR(I) random effects. 

The paper is organized as follow: Section 2 presents the 
development of the extension of the EMMIX-WIRE model to 
incorporate AR(1) random effects which are fitted under the EM 
framework. We conduct a simulation study and the data analysis 
with two real yeast cell data in Section 3. In the last section, some 
discussion is provided. The technical details of the derivations are 
provided in the Supplementary Information. 

2 EMMIX-WIRE MODEL WITH AR(1) RANDOM 
EFFECTS 

We let X denote the design matrix and /? the associated vector of 
regression coefficients for the fixed effects. In the specifi cation of 
the mixture of mixed linear components as adopted by iNg et al\ 



( l2006h . the vector pj for the jth gene conditional on its membership 
of the hth component of the mixture is expressed as 



y ■ = Xf3^ + ZiUjh + Z2Vh + ejh U = 1, 



(2) 



where Ph is a {2k + 1) vector containing unknown parameters 
ao,ai, . . . ,ak,bi, . . . ,bk; see (I), Ujh = (ujhi, ■ ■ ■ ,Ujhm) 
and Vh = {vhi, ■ ■ ■ ,Vhm)'^ are the random effects, where m 
is the number of time points. In ([2J, Zi and Z2 are m x m 
identity matrices. Without loss of generality, we assume ejh 
and Vh to be independent and normally distributed, A'^(0, Q.) 
and N{0,D), independent of Ujh- To further account for the 
time dependent random gene effects, a first-order autoregressive 
correlation structure is adopted for the gene profiles, so that Ujh 
follows a A''(0, d^A{p)) distribution, where 



.-1 . 

1-2 



(3) 



The inverse of A{p) can be expressed as 

Aip)-' = il + p^)I^pJ-p'K, 



and 



trace (^^^^Aip) ] = -2p/(l - p'), 



(4) 



(5) 



where J, J and K are all m x m matrices. Specifically, I is 
the identity matrix; J has its sub-diagonal entries ones and zeros 
elsewhere, and K takes on the value 1 at the first and last element 
of its principal diagonal and zeros elsewhere. The expressions l|4j 
and ^ are needed in the derivation of the maximum likelihood 
estimates of the parameters. 

The assumptions 1I2I1 and (O imply that our new model assumes 
an auto-correlation covariance structure under which measurements 
at each time point have a larger variance compared to the model of 
Kim et al. (2008) under an AR(1) auto-correlation residual structure. 

In the context of mixture models, we consider the ^-component 
mixture with probability density function (pdf) as 

9 

fiV I *) = I Ph,^h:0lAH,Du), (6) 

h=l 

where fh is the component-pdf of the multivariate normal 
distribution with mean vector XhPh and covariance matrix 

elZiAhZ'^ + ZiDhZl + ^h. 

The vector of unknown parameters is denoted by and can be 
estimated by maximum likelihood via the EM algorithm. 

In the EM framework adopted here, the observed data vector y — 



(2/1,2/2, 
labels. 



Zl,Z2, . 



is augmented by the unobservable component 



, 2n of 1/1,2/2, 



where zj is the g- 



dimensional vector with hth element Zjh, which is equal to 1 if 
i/j comes from the hth component of the mixture, and is zero 
otherwise. These unobservable values are considered to be missing 
data and are included in the so-called complete-data vector. Finally, 



we take the random effect vectors Ujh and Vh (j = 



, n; h 
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1, . . . , g), to be missing and include them too in tiie complete-data 
vector. Now the so called complete-data log-likelihood Ic is the sum 
of four terms Ic ~ li + h + h + U, where 

g n 

^1 = XI ^jhlog{ph) (7) 

h = l j = l 

is the logarithm of the probability of the component labels Zjh, and 
where I2 is the logarithm of the density function of y conditional 
on Ujh , Vh, and Zjh=l, and Z3 and I4 is the logarithm of the density 
function of u and v, respectively, given Zjh=l, 

g n 

h = - \ '^^Zjh (mlog{2Tv) + log\nh \ + ejh^h^ejh), (8) 

g n 

h = -i Xim^J'^ {rnlog{2-Kel) + log\Ah\ + e'^^u^^A^^Uju^ , 
h=i j=i 

(9) 

g n 

U = -^Yl C^^i'^'} (mlog{2-K) + log\Dh\ +vlD~^v,)j, 

h = l i = l 

(10) 

where 

<=j'> = V] — — ZlUjh — Z2Vh- 

To maximize the complete-data log likelihood Ic, the above 
decomposition implies that each of li, h, h, and I4, can be 
maximized separately. The EM algorithm proceeds iteratively until 
the difference between successive values of the log likelihood is less 
than some specified threshold. All major derivations are given in the 
Supplementary Information. 

3 SIMULATIONS AND APPLICATIONS 

3.1 Simulation study 

To illustrate the performance of the proposed model, we present 
a simulation study based on synthetic time-course data. In the 
following simulation, we consider an autocorrelation dependence 
for the periodic expressions and compare our model to that of 
iKim et al ] ( l2008h . Synthetic time-course data from three different 
parametric models (the full model under our new extended EMMIX- 
WIRE a pproach (denoted by EM-W in the ta bles), the extende d 
model of bin and Sei3 l l2006l) . and the model of lKim et al\ l l2008h ). 
assuming a first-order Fourier series of periodicity, are considered in 
the simulation study. Within each model, we consider two different 
settings of 6'^ corresponding to low and high auto-correlation among 
the periodic gene expressions. We also assume that SI and D 
are diagonal matrices, where the common diagonal elements are 
represented by and d?, respectively. 

There are three classes of genes. The periods for each class are 
6, 10 and 16, respectively. There are 24 measurements at time 
points 0, 1, 23, and the first order Fourier expansion is adopted 
in the simulation models. Parameters and simulation results are 
listed in Tables 1 to 6. In each table, we summarize the results 
from 1000 simulated sets of data. The true values of the parameters 
and the means of their estimates are given in these tables, along 
with the standard errors in parentheses. We terminated the EM 
algorithm iterations when the absolute values of the relative changes 
in all estimates between consecutive iterations were smaller than 



Table 1. Bias and standai'd deviation in brackets from 1000 simulated data 
points (generated from new EMMIX-WIRE (EM-W) model with e'l equal to 
0.5) 



First component Second component Third component 



Parameters 


EM-W 


Kim 


EM-W 


Kim 


EM-W 


Kim 


p(0.585. 


-0.002 


0.016 


-0.009 


-0.001 


0.011 


-0.015 


0.1,0.315) 


(0.045) 


(0.052) 


(0.033) 


(0.029) 


(0.051) 


(0.051) 


ao(0.3. 


0.002 


0.008 


-0.006 


-0.036 


-0.003 


-0.009 


1,0.2) 


(0.135) 


(0.137) 


(0.175) 


(0.186) 


(0.186) 


(0.182) 


ai(0.03. 


-0.001 


-0.018 


0.024 


0.004 


0.004 


-0.001 


1,0.02) 


(0.119) 


(0.124) 


(0.272) 


(0.160) 


(0.175) 


(0.152) 


61 (0.06, 


0.009 


-0.015 


-0.164 


0.031 


0.027 


0.008 


0.9,0.01) 


(0.119) 


(0.132) 


(0.223) 


(0.160) 


(0.149) 


(0.183) 


6*^(0.5, 


0.055 


1.543 


0.089 


1.346 


0.110 


1.443 


0.5,0.5) 


(0.082) 


(1.547) 


(0.164) 


(1.349) 


(0.152) 


(1.446) 


p{0.6 


-0.023 


-0.395 


-0.043 


-0.372 


-0.043 


-0.392 


0.6,0.6) 


(0.036) 


(0.397) 


(0.082) 


(0.374) 


(0.058) 


(0.394) 


(T^(1.0, 


0.0171 




-0.017 




0.011 




1.0,1.0) 


(0.055) 




(0.127) 




(0.088) 






-0.112 




-0.091 




-0.118 




0.2,0.3) 


(0.145) 




(0.102) 




(0.134) 








EM-W 






Kim 




EiTor rate 




0.036 






0.098 




Rand 




0.954 






0.864 




Adjusted 




0.907 






0.726 





Table 2. Bias and standai'd deviation in brackets from 1000 simulated data 
points (generated from new EMMIX-WIRE (EM-W) model with 6^ equal to 
1.3) 



First component Second component Third component 



Parameters 


EM-W 


Kim 


EM-W 


Kim 


EM-W 


Kim 


p(0.585. 


-0.006 


0.035 


-0.009 


-0.002 


0.015 


-0.033 


0.1,0.315) 


(0.061) 


(0.080) 


(0.047) 


(0.045) 


(0.070) 


(0.074) 


ao(0.3. 


0.001 


0.018 


-0.004 


-0.069 


-0.00 


-0.014 


1,0.2) 


(0.137) 


(0.147) 


(0.173) 


(0.197) 


(0.186) 


(0.178) 


ai(0.03, 


0.010 


-0.062 


0.017 


-0.031 


0.001 


-0.002 


1,0.02) 


(0.162) 


(0.227) 


(0.388) 


(0.236) 


(0.230) 


(0.199) 


f)i(0.06. 


0.009 


-0.042 


-0.180 


0.073 


0.032 


0.009 


0.9,0.01) 


(0.124) 


(0.166) 


(0.235) 


(0.188) 


(0.163) 


(0.213) 


6*2(1.3, 


-0.042 


1.671 


-0.030 


1.449 


0.008 


1.549 


1.3,1.3) 


(0.097) 


(1.677) 


(0.223) 


(1.460) 


(0.153) 


(1.556) 


p(0.6 


0.009 


-0.249 


-0.001 


-0.228 


0.002 


-0.250 


0.6,0.6) 


(0.020) 


(0.251) 


(0.055) 


(0.235) 


(0.025) 


(0.252) 


0-2(1.0, 


0.131 




0.121 




0.141 




1.0,1.0) 


(0.155) 




(0.219) 




(0.186) 






-0.151 




-0.124 




-0.160 




0.2,0.3) 


(0.172) 




(0.129) 




(0.168) 








EM-W 






Kim 




Error rate 




0,094 






0.184 




Rand 




0.881 






0.759 




Adjusted 




0.760 






0.519 
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Table 3. Bias and standard deviation in brackets from 1000 simulated data 
points (generated from new EMMIX-WIRE (EM-W) model with equal to 
0.5 and d? equal to 0) 



Table 4. Bias and standard deviation in brackets from 1000 simulated data 
points (generated from new EMMIX-WIRE (EM-W) model with 6*2 equal to 
1.3 and d? equal to 0) 



First component Second component 



Parameters 


EM-W 


Kim 


EM-W 


Kim 


EM-W 


Kim 


Parameters 


EM-W 


Kim 


EM-W 


Kim 


EM-W 


Kim 


p(0.585. 


0.001 


0.008 


-0.001 


-0.003 


-0.001 


-0.005 


p(0.585, 


-0.001 


0.024 


0.002 


-0.005 


-0.001 


-0.019 


0.1,0.315) 


(0.009) 


(0.012) 


(0.008) 


(0.008) 


(0.010) 


(0.011) 


0.1,0.315) 


(0.014) 


(0.029) 


(0.016) 


(0.017) 


(0.017) 


(0.026) 


ao(0.3. 


0.001 


0.008 


-0.001 


-0.018 


0.003 


-0.014 


ao(0.3, 


-0.001 


0.018 


0.003 


-0.046 


0.000 


-0.005 


1,0.2) 


(0.017) 


(0.019) 


(0.018) 


(0.026) 


(0.016) 


(0.016) 


1,0.2) 


(0.027) 


(0.035) 


(0.026) 


(0.053) 


(0.021) 


(0.021) 


ai(0.03. 


-0.002 


-0.023 


-0.001 


-0.005 


0.003 


-0.006 


ai(0.03. 


0.001 


-0.068 


0.005 


-0.041 


0.001 


0.008 


1 A mi 

1,U.UZ_) 














1 A A0\ 

i,u.uz; 


/A AS^\ 


I'A \ Afi,\ 
(U. 140) 


I'A 1 Afi\ 




I'A AGA1 
(^U.UoD ) 


/A AS^\ 


fei(0.06. 


-0.001 


-0.014 


0.016 


0.019 


0.002 


0.004 


6i(0.06, 


0.003 


-0.031 


0.005 


0.047 


0.002 


0.004 


0.9,0.01) 


(0.026) 


(0.031) 


(0.033) 


(0.038) 


(0.032) 


(0.033) 


0.9,0.01) 


(0.042) 


(0.063) 


(0.054) 


(0.072) 


(0.050) 


(0.054) 


5^(0.5, 


0.071 


1.162 


0.081 


1.158 


0.078 


1.159 


e^(i.3, 


-0.059 


1.254 


-0.076 


1.251 


-0.052 


1.242 


0.5,0.5) 


(0.081) 


(1.162) 


(0.119) 


(1.160) 


(0.090) 


(1.159) 


1.3,1.3) 


(0.087) 


(1.254) 


(0.178) 


(1.257) 


(0.104) 


(1.243) 


p(0.6 


-0.032 


-0.337 


-0.037 


-0.339 


-0.036 


-0.339 


p(0.6 


0.012 


-0.198 


-0.013 


-0.201 


0.009 


-0.203 


0.6,0.6) 


(0.038) 


(0.337) 


(0.062) 


(0.340) 


(0.045) 


(0.340) 


0.6,0.6) 


(0.019) 


(0.199) 


(0.039) 


(0.206) 


(0.023) 


(0.204) 


0-^(1.0, 


-0.059 




-0.069 




-0.064 




0-^(1.0, 


0.046 




0.056 




0.039 




1.0,1.0) 


(0.068) 




(0.106) 




(0.077) 




1.0,1.0) 


(0.070) 




(0.145) 




(0.084) 




d?(0. 







0.001 




0.000 




cP(0., 


0.000 




0.001 




0.000 




0,0) 


(0.000) 




(0.001) 




(0.001) 




0.,0.) 


(0.000) 




(0.001) 




(0.000) 








EM-W 






Kim 








EM-W 






Kim 




Error rate 




0.078 






0.081 




Error rate 




0.154 






0.162 




Rand 




0.891 






0.886 




Rand 




0.796 






0.783 




Adjusted 




0.780 






0.769 




Adjusted 




0.590 






0.566 





0.00001, with the maximum iterati on of 1000. For ou r model, we 
started from the true partition; for iKim et a/] l l2008l) . we started 
from the true values of parameters. Alternatively, initialization 
procedures have been considered for mixtures of regressi on models 
with and without random effects l lScharl et ali |2010|) . For the 
comparison, we consider the misclas sified error rate, th e Ran d 
Index, and the adjusted Rand Index l lHubert and ArabieL Il985h . 
where the latter two assess the degree of agreement between the 
partition and the true clusters of genes. A larger (adjusted) Rand 
Index indicates a higher level of agreement. 

Specifically, we first investigate the performance of ou r new 
extended EMMIX-WIRE model and that of lKim et al\ | |2008|) when 
the data are generated from the extended EMMIX-WIRE model, in 
which gene expressions within a cluster are correlated. As listed in 
Tables 1 and 2, the estimates of the parameters p, ao,ai,bi,6^ , p, 
and in the proposed model are approximately unbiased, except 
fo r cP, which is sli ghtly underestimated. In contrast, the method 
of iKim et 'ai\ ( l2008h fails to capture the contributions from gene- 
specific and tissue-specific effects on the auto-correlation among 
periodic gene expressions at each time point, and thus overestimates 
the correlation between different time points for each gene. Their 
method therefore leads to an inferior clustering performance in 
terms of higher error rates and smaller Rand Indices. 

We now compare our model with Kim et al. I 1I2OO8I) using the 
data from the extended model of lOin and Selj ( l2o6i7 , which is a 



special case of our EMMIX-WIRE model (with eP = 0), where gene 
expressions are independent. The results are presented in Tables 3 
and 4. As we explained in the last paragraph, the system errors are 
removed in this situation. And our model has unbiased estimation 



for all parameters. On the other hand, the model o f lKimef a/.|j2008h 
still overestimates the residual variance at different time points and 
underestimates the correlation between different time points for 
each gene, as it fails to capture the contribution from gene-specific 
effects to the auto-correlation among periodic gene expressions at 
each time point. Their method again produces larger error rates and 
slightly smaller Rand Indices. 

Lastly, we generate the data from the model of lKim et al\ ( l2008h 
and provide comparative results in Tables 5 and 6. It is observed 
from Tables 5 and 6 that the clustering performances are comparable 
between the two models. 

Our model again provi des unbiased estim ates for all parameters. 
In contrast to the model o f lKim et al\ j2008l) . our model accounts for 
the correlation among gene profiles via the linear effects modelling. 
As presented in T ables 1 to 6, our model outperforms the model of 
iKim et al. I1 I2OO8I) when the genetic profiles are correlated. When the 
genetic profiles are generated independently, our model has better 
performance in cases where the variability in gene expressions at 
each time point is large. In cases where the residual covariance 
structure follows an AR(1) model (Kim et al., 2008), our model still 
provides compara tive results and unbiased estimates as the model of 
iKim et al. 1 1 I2OO8I) . The advantage of our model is to provide more 
reliable and robust clustering of time-course data is apparent. With 
microarray experiments including those time-course studies, gene 
expression levels measured from the same tissue sample (or time 
point) are correlated (McLachlan et al., 20(3), clustering methods 
which ass ume independentl y distributed gene profiles, such as the 
model of I Kim et al\ ( l2008h . may overlook important sources of 
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Table 5. Bias and sta ndard deviation in brackets from 1000 simulated data 
points (generated from lKim et al\ ilOOSh with 0^ equal to 0.5) 



First component Second component Third component 



Parameters 


EM-W 


Kim 


EM-W 


Kim 


EM-W 


Kim 


p(0.585, 


-0.003 


0.000 


-0.008 


0.001 


0.010 


-0.000 


0.1,0.315) 


(0.004) 


(0.003) 


(0.023) 


(0.003) 


(0.024) 


(0.004) 


ao(0.3. 


0.002 


0.000 


0.003 


0.001 


0.001 


0.001 


1,0.2) 


(0.013) 


(0.013) 


(0.010) 


(0.010) 


(0.010) 


(0.010) 


ai(0.03. 


0.015 


0.001 


-0.236 


-0.002 


0.047 


0.003 


1,0.02) 


(0.041) 


(0.036) 


(0.333) 


(0.037) 


(0.073) 


(0.035) 




A ni /I 

U.U14 


-u.uuu 


-U.jUo 


n Am 


U.UJo 


U.UUl 


0.9,0.01) 


(0.026) 


(0.021) 


(0.345) 


(0.023) 


(0.067) 


(0.025) 


6(^(0.5, 


-0.034 


-0.000 


-0.006 


-0.001 


-0.021 


-0.000 


0.5,0.5) 


(0.036) 


(0.006) 


(0.027) 


(0.015) 


(0.025) 


(0.009) 


p(0.6 


0.020 


-0.000 


0.013 


-0.001 


0.023 


-0.001 


0.6,0.6) 


(0.021) 


(0.007) 


(0.025) 


(0.017) 


(0.028) 


(0.009) 


0-^(0.0, 


0.025 




0.014 




0.022 




0.0,0.0) 


(0.026) 




(0.015) 




(0.023) 






0.000 




0.045 




0.042 




0,0) 


(0.000) 




(0.095) 




(0.056) 








EM-W 






Kim 




EiTor rate 




0.018 






0.016 




Rand 




0.978 






0.980 




Adjusted 




0.955 






0.960 





Table 6. Bias and sta ndard deviation in brackets from 1000 simulated data 
points (generated from lKim et al\ fioOSi ) with 0^ equal to 1.3) 



First component Second component Third component 



Parameters 


EM-W 


Kim 


EM-W 


Kim 


EM-W 


Kim 


p{0.5&5. 


-0.009 


0.001 


-0.007 


0.005 


0.016 


-0.001 


0.1,0.315) 


(0.013) 


(0.010) 


(0.012) 


(0.011) 


(0.020) 


(0.013) 


ao(0.3. 


-0.002 


-0.000 


0.015 


0.001 


0.003 


-0.000 


1,0.2) 


(0.023) 


(0.023) 


(0.024) 


(0.019) 


(0.016) 


(0.016) 


ai(0.03. 


-0.005 


-0.001 


0.054 


-0.000 


0.003 


0.000 


1,0.02) 


(0.071) 


(0.074) 


(0.0928) 


(0.083) 


(0.068) 


(0.064) 


6i(0.06, 


0.015 


-0.000 


-0.131 


0.001 


0.020 


0.000 


0.9,0.01) 


(0.036) 


(0.036) 


(0.135) 


(0.045) 


(0.041) 


(0.043) 


0^(1.3, 


-0.195 


-0.000 


-0.185 


-0.003 


-0.186 


-0.002 


1.3,1.3) 


(0.196) 


(0.016) 


(0.192) 


(0.049) 


(0.189) 


(0.025) 


p(0.6 


0.043 


-0.000 


0.037 


-0.002 


0.044 


-0.001 


0.6,0.6) 


(0.043) 


(0.007) 


(0.042) 


(0.022) 


(0.045) 


(0.010) 


(7^(0.0. 


0.144 




0.131 




0.143 




0.0,0.0) 


(0.145) 




(0.133) 




(0.144) 




d2(0.. 


0.000 




0.000 




0.001 




0.,0.) 


(0.000) 




(0.001) 




(0.001) 








EM-W 






Kim 




Error rate 




0.101 






0.102 




Rand 




0.864 






0.866 




Adjusted 




0.725 






0.729 





variability in the experiments, resultin g in the consequ ent possibility 
of misleading inferences being made jNg et 







Fig. 1. Clustering of gene expression profiles into four groups for the yeast 
dataset 1. 



3.2 Applications: Yeast cell cycle datasets 

3.2.1 Yeast cell cycle dataset 1 The first data is the yeast cell 
cycle data with MIPS criterion fro m Wong et at ] ( l2007t) . This data 
set is extracted from ICho et al\ ( 1200 ih and made available by 
lYeung et a/] ( l200l[) . The yeast cell cycle dataset contains 237 genes 
and 17 samples. These genes corresponding to four categories in 
the MIPS database (DNA synthesis and replication, organization 
of centrosome, nitrogen, and sulphur metabolism, and ribosomal 
proteins); these are assumed to be the true clusters. In this 
illustratio n, we fit ou r new e xtended EMMIX-WIRE model and the 
model of iKim et al\ ( l2008h to the y east c ell cycle data, with the 
period of 85 in the F ourier extensi on i Luan a nd Li, 2004). 

In the Table 2 of IWong et al\ ( l2007h , it shows that the Rand 
and adjusted Rand Indices for their two-stage method are 0.7087 
and 0.3697, respectively, and these indices are higher than other 
methods considered in their paper. Using the model of iKim et al\ 
ilOOA the Rand indices are 0. 7330 and 0.472 1, respectively. With 
the model of EMMIX-WIRE jNg et all l2006h . we have the Rand 
and adjusted Rand Indices 0.7799 and 0.5568, respectively. Using 
the proposed new model, the Rand and adjusted Rand Indices 
are 0.8123 and 0.6189, respectively, and are the best matches 
(the largest index) compared with the aforementioned models. The 
four clusters of genes time-course profiles are presented in Figure 
1. It can be seen that the genes have very similar expression 
patterns within each cluster, except in cluster 2, where there is 
greater individual variation by some of the genes. The estimation 
using the proposed model is listed in Table 7. It can be seen that 
the correlations in the first three components are from 0.27 to 
0.72, indicating a significant correlation among gene expressions 
at different time points. Ignoring this correlation may therefore lead 
to a lower Rand Index, that is, a worse clustering. We can see the 
estimates of (P in clusters I and 4 are large and are greater than 
the corresponding estimates of 6^, indicating co-regulation in these 
two clusters. If we ignore such within-cluster co-regulation, we will 
have Rand Indices similar to those of , Kim et all (200^. Our model 
considers both autocorrelation and co-regulation, and thus obtains 
the best clustering performance. 
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Table 7. Estimations of parameters for tlie yeast cell cycle dataset 1 
(237 genes) 





first cluster 


second cluster 


third cluster 


fourth cluster 


p 


0.104 


0.054 


0.118 


0.724 




-0.107 


0.400 


-0.807 


0.298 


bi 


1.009 


-0.119 


-0.053 


0.079 




0.027 


0.011 


0.025 


0.278 


02 


0.174 


0.417 


0.443 


0.307 


P 


0.278 


0.717 


0.435 


0.053 


cP 


0.191 


0.001 


0.031 


0.310 




85 


85 


85 


85 





50 100 150 





50 100 150 




50 100 150 



Fig. 2. Clustering of gene expression profiles into five groups for the yeast 
dataset 2. 



Table 8. Estimations of parameters for the yeast cell cycle dataset 2 (384 genes) 





first cluster 


second cluster 


third cluster 


fourth cluster 


fifth cluster 


p 


0.238 


0.290 


0.151 


0.165 


0.157 


ai 


0.643 


-0.061 


-0.736 


-0.616 


0.329 


bi 


-0.062 


1.019 


0.285 


-0.772 


-1.001 




0.011 


0.046 


0.037 


0.028 


0.006 




0.498 


0.296 


0.470 


0.309 


0.244 


p 


0.503 


0.269 


0.364 


0.379 


0.550 




0.062 


0.052 


0.044 


0.065 


0.030 




85 


85 


85 


85 


85 



3.2.2 Yeast cell cycle dataset 2 The seco nd example is the subset 
of 384 genes from the yeast cell cycle data dCho et fl/ll200lh while 
the full data set can be found from the Stanford yeast cell cycle web 
site( http: / /171 . 65 . 26 . 52/yeast_cell_cycle/ 
cellcycle.html). 

Each of gene is assigned a "phase". We call each "phase" a "Main 
Group". There are five "Main Groups" in this dataset, namely, early 
Gl, late Gl, S, G2 and M. We now compare and assess the cluster 
quality with the external criterion (the 5 phases). The raw data is log 
transformed and normalized by columns and rows. Figure 2 presents 
the five clusters of genes profiles obtained using the proposed model. 
It can be seen that the genes have very similar expression patterns 
within each cluster. The estimations are listed in Table 8. The Rand 
and adjusted Rand Indices are 0.8102 a nd 0.4484, respect ively. They 
are 0.8108 and 0.4592 for the model of lKim et al\ l l2008l) . The error 
rates are the same (0.2813) for the two models. The performances 
of the two models are very similar because the correlation among 
gene profiles is weak in this dataset. As indicated in Table 8, the 
estimates of <f are all very small compared to the estimates of 6^ . 



4 DISCUSSION 

We have presented a new mixture model with AR(1) random 
effects for the clustering of time-course gene expression profiles. 
Our new model involves three elements taking important role 
in modelling time-course periodic expression data, namely, (a) 
Fourier expansion which models the periodic patterns; (b) auto- 
correlation variance structure that accounts for the auto-correlation 
among the observations at different time points; and (c) the cluster- 
specific random effects which incorporate the co-regulation within 
the clusters. In particular, the latter two elements corresponding to 
the correlations between time-points and between genes are crucial 
for reliable and accurate clustering of time-course data. We have 
demonstrated in the simulation and real examples that the accuracy 
of clustering is improved if the auto-correlation among the time 
dependent gene expression profiles has been a ccounted for alon g 
the time points; this is also demonstrated in I Kim et al ] l l2008h . 
Furthermore, better results are obtained if the co-regulation within 
the clusters is modelled appropriately. When the correlation between 
genetic profiles is not small, which is the case for typical time- 
course data, ignorance of this dependency may lead to less accurate 
clustering results. 

For the purpose of comparison, the periods of the signal of gene 
expression are assumed to be known in the simulation study and 
applications to real data. In practi ce, there are several ways to 
estimate the pe riods fo r each c l uster I KimefaZ . l2008l ; lLuan and lH 
2004; Spe llmaii et all 1 19981 ; iNgef a/l l2o'o6l) . For example, in 
iKim et al. i 120081) , the periods are estimated using simplex algorithm 
at the M-step during the EM algorithm. However, when the periods 
are estimated during the EM iterations, we find that the periods 
depend also on other parameters. In addition, when we start from 
an initial period and get the design matrix X, then with higher 
possibility the best period will be the initial periods. So we change 
the strategy to a slow one, and we call it global grid search method, 
which guarantees the highest maximum log likelihood at the best 
periods. It performs as follow, let S is a set with its element as 
(period u)\, period lj2, ■ ■ • , period uig), where oj;, can take all 
possible values (grid points). For example, for the yeast cell cycle 
data, the possible periods are 60, 61, . . . , 90. Then for each fixed 
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(ljJi, u!2,. ■ ■ , ujg), we estimate the parameters as if tlie periods for 
eacfi component are known. Finally we compare tfie log likelifiood 
and clroose tfie one witlr tire highest log likelihood as the final result. 
Since it is very slow if there are too many elements in set S when 
we have no prior information about periods, we recommend use 
other method to get the periods first iBooth et all |200^ . In all the 
calculation in this paper, we assume the period is fixed, that is, there 
is only one element in the set S. 

The proposed model is very flexible through the different 
specificatio ns of des i gn ma trices or model options as originally 
available in iNg et al\ OOoi) . For exampl e, besides the full m odel, 
it enables us to incorporate the model of lOin arid SeS bOOd) as a 
special case. Specifically, we can obtain their model by assuming 
zero cluster effects (v = 0) and that random effects u be auto 
correlated for each gene. Furthermore, when both random effects 
u and V are assumed to be zero, then we have normal mixture 
of regression models. In the program we have developed, there 
are many options and parameters for users to specify the models 
they want to use in addition to the models we list in our paper. 
The program is written in R package and is available from the 
corresponding author. 
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